Our project investigates the attributes that contribute to the successful relationship between a country and its citizens, focusing on social well-being and mental health. In this project, we aim to investigate how a country's happiness score, a subjective notion, can be influenced by objective factors such as country population, density, and education inequality. We will also analyze the correlation between a country's population and educational inequality. This project is also important as it pinpoints specific areas where countries can improve based on the perspective of their citizens. Effective use of this information can lead to improved living conditions, and social cohesion, and promote sustainable development.
At the core of this project, our objective is to provide the public with valuable insights into the well-being of nations and their citizens, helping policymakers understand the factors that contribute to an individual’s happiness. Such insights can identify trends, disparities, and successful policies that can be shared and adapted by other countries to improve their citizen’s quality of life. In addition, this information can be extremely helpful to those who are considering moving to a different country and want to evaluate the different kinds of factors that promote liveability. For example, someone who comes from a country with a low social support score may have a higher priority in that attribute and want to move to a different country to meet that specific need.
Our chosen datasets originate from the World Happiness Report, Worldometer, and the Human Development Report, which explore social well-being in terms of personal happiness, a country’s population, and a country’s education inequality respectively. These cleaned CSV datasets are organized in a tabular format and are free from unintentional missing or duplicate values. These datasets were directly shared with and downloaded from Kaggle.com, and are therefore available for public use. We will conjoin these tables together to uncover underlying relationships and patterns between fields that was not possible if these datasets were alone. To complete this task, we will first:
# Libraries used:
import pandas as pd
import sqlalchemy as sq
import requests
### Read CSV as dataframe tables: ###
# Multiples of the same datasets were uploaded as each member did their investigation
# separately and differently to get our final results.
## 1. Happiness Report dataset: ##
# Happiness table used for Question 1.
url = "https://github.com/boli3ucalgary/604/raw/main/DataForTable2.1WHR2023.xls"
response = requests.get(url)
with open("DataForTable2.1WHR2023.xls", "wb") as file:
file.write(response.content)
happyreport = pd.read_excel("DataForTable2.1WHR2023.xls")
%pip install openpyxl
# Happiness table used for Question 2.
score = pd.read_csv("WHR20_DataForFigure2.1.csv")
# Happiness table used for Question 3.
happiness = pd.read_csv("WHR20_DataForFigure2.1.csv")
## 2. Population 2020 dataset ##
# Population table used for Question 2.
populations = pd.read_csv("population_by_country_2020.csv", na_values= "N.A.")
populations.columns = ["Country", "Population", "Growth", "Net Change", "Density", "Area", "Migrants", "Fertility Rate", "Median Age", "Urban Population", "World Share"]
# Population table used for Question 3.
population2020 = pd.read_csv("population_by_country_2020.csv")
# Needed to change column name as a Programming Error would occur when calling for this dataset
population2020.rename(columns = {'Urban Pop %':'Urban Pop Percentage'}, inplace = True)
## 3. Inequality Education dataset: ##
# Education Inequality table used for Question 1.
Inequalityreport = pd.read_csv("Inequality in Education.csv")
# Education Inequality table used for Question 3.
inequality_ed = pd.read_csv("Inequality in Education.csv")
# Connect to your database; include a cell at the bottom of this notebook to dispose of your engine object
sq.__version__
engine = sq.create_engine('mysql+mysqlconnector://l01-10:Ois04E11CcdNr@datasciencedb2.ucalgary.ca/l01-10')
Requirement already satisfied: openpyxl in /opt/conda/lib/python3.10/site-packages (3.0.10) Requirement already satisfied: et_xmlfile in /opt/conda/lib/python3.10/site-packages (from openpyxl) (1.0.1) Note: you may need to restart the kernel to use updated packages.
# Run Engine - push data tables to the cloud (only need to do once)
happiness.to_sql('happiness', engine, if_exists='replace' )
153
population2020.to_sql('population2020', engine , if_exists='replace')
235
inequality_ed.to_sql('education', engine , if_exists='replace')
195
happyreport.to_sql('happy', engine, if_exists='replace')
2199
Inequalityreport.to_sql('inequality', engine, if_exists='replace')
195
score.to_sql('happiness_db', engine, if_exists='replace')
153
populations.to_sql('population_db', engine, if_exists='replace')
235
# Call table of happiness database
happiness_df = pd.read_sql_table("happiness", engine)
happiness_df.head(10)
| index | Country name | Regional indicator | Ladder score | Standard error of ladder score | upperwhisker | lowerwhisker | Logged GDP per capita | Social support | Healthy life expectancy | ... | Generosity | Perceptions of corruption | Ladder score in Dystopia | Explained by: Log GDP per capita | Explained by: Social support | Explained by: Healthy life expectancy | Explained by: Freedom to make life choices | Explained by: Generosity | Explained by: Perceptions of corruption | Dystopia + residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Finland | Western Europe | 7.8087 | 0.031156 | 7.869766 | 7.747634 | 10.639267 | 0.954330 | 71.900825 | ... | -0.059482 | 0.195445 | 1.972317 | 1.285190 | 1.499526 | 0.961271 | 0.662317 | 0.159670 | 0.477857 | 2.762835 |
| 1 | 1 | Denmark | Western Europe | 7.6456 | 0.033492 | 7.711245 | 7.579955 | 10.774001 | 0.955991 | 72.402504 | ... | 0.066202 | 0.168489 | 1.972317 | 1.326949 | 1.503449 | 0.979333 | 0.665040 | 0.242793 | 0.495260 | 2.432741 |
| 2 | 2 | Switzerland | Western Europe | 7.5599 | 0.035014 | 7.628528 | 7.491272 | 10.979933 | 0.942847 | 74.102448 | ... | 0.105911 | 0.303728 | 1.972317 | 1.390774 | 1.472403 | 1.040533 | 0.628954 | 0.269056 | 0.407946 | 2.350267 |
| 3 | 3 | Iceland | Western Europe | 7.5045 | 0.059616 | 7.621347 | 7.387653 | 10.772559 | 0.974670 | 73.000000 | ... | 0.246944 | 0.711710 | 1.972317 | 1.326502 | 1.547567 | 1.000843 | 0.661981 | 0.362330 | 0.144541 | 2.460688 |
| 4 | 4 | Norway | Western Europe | 7.4880 | 0.034837 | 7.556281 | 7.419719 | 11.087804 | 0.952487 | 73.200783 | ... | 0.134533 | 0.263218 | 1.972317 | 1.424207 | 1.495173 | 1.008072 | 0.670201 | 0.287985 | 0.434101 | 2.168266 |
| 5 | 5 | Netherlands | Western Europe | 7.4489 | 0.027792 | 7.503372 | 7.394428 | 10.812712 | 0.939139 | 72.300919 | ... | 0.207612 | 0.364717 | 1.972317 | 1.338946 | 1.463646 | 0.975675 | 0.613626 | 0.336318 | 0.368570 | 2.352117 |
| 6 | 6 | Sweden | Western Europe | 7.3535 | 0.036234 | 7.424519 | 7.282481 | 10.758794 | 0.926311 | 72.600769 | ... | 0.111615 | 0.250880 | 1.972317 | 1.322235 | 1.433348 | 0.986470 | 0.650298 | 0.272828 | 0.442066 | 2.246299 |
| 7 | 7 | New Zealand | North America and ANZ | 7.2996 | 0.039465 | 7.376953 | 7.222248 | 10.500943 | 0.949119 | 73.202629 | ... | 0.191598 | 0.221139 | 1.972317 | 1.242318 | 1.487218 | 1.008138 | 0.646790 | 0.325726 | 0.461268 | 2.128108 |
| 8 | 8 | Austria | Western Europe | 7.2942 | 0.033365 | 7.359596 | 7.228804 | 10.742824 | 0.928046 | 73.002502 | ... | 0.085429 | 0.499955 | 1.972317 | 1.317286 | 1.437445 | 1.000934 | 0.603369 | 0.255510 | 0.281256 | 2.398446 |
| 9 | 9 | Luxembourg | Western Europe | 7.2375 | 0.030852 | 7.297970 | 7.177031 | 11.450681 | 0.906912 | 72.599998 | ... | -0.004621 | 0.367084 | 1.972317 | 1.536676 | 1.387528 | 0.986443 | 0.610137 | 0.195954 | 0.367041 | 2.153700 |
10 rows × 21 columns
# Call table of population database
population_df = pd.read_sql_table("population2020", engine)
population_df.head(10)
| index | Country (or dependency) | Population (2020) | Yearly Change | Net Change | Density (P/Km²) | Land Area (Km²) | Migrants (net) | Fert. Rate | Med. Age | Urban Pop Percentage | World Share | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | China | 1440297825 | 0.39 % | 5540090 | 153 | 9388211 | -348399.0 | 1.7 | 38 | 61 % | 18.47 % |
| 1 | 1 | India | 1382345085 | 0.99 % | 13586631 | 464 | 2973190 | -532687.0 | 2.2 | 28 | 35 % | 17.70 % |
| 2 | 2 | United States | 331341050 | 0.59 % | 1937734 | 36 | 9147420 | 954806.0 | 1.8 | 38 | 83 % | 4.25 % |
| 3 | 3 | Indonesia | 274021604 | 1.07 % | 2898047 | 151 | 1811570 | -98955.0 | 2.3 | 30 | 56 % | 3.51 % |
| 4 | 4 | Pakistan | 221612785 | 2.00 % | 4327022 | 287 | 770880 | -233379.0 | 3.6 | 23 | 35 % | 2.83 % |
| 5 | 5 | Brazil | 212821986 | 0.72 % | 1509890 | 25 | 8358140 | 21200.0 | 1.7 | 33 | 88 % | 2.73 % |
| 6 | 6 | Nigeria | 206984347 | 2.58 % | 5175990 | 226 | 910770 | -60000.0 | 5.4 | 18 | 52 % | 2.64 % |
| 7 | 7 | Bangladesh | 164972348 | 1.01 % | 1643222 | 1265 | 130170 | -369501.0 | 2.1 | 28 | 39 % | 2.11 % |
| 8 | 8 | Russia | 145945524 | 0.04 % | 62206 | 9 | 16376870 | 182456.0 | 1.8 | 40 | 74 % | 1.87 % |
| 9 | 9 | Mexico | 129166028 | 1.06 % | 1357224 | 66 | 1943950 | -60000.0 | 2.1 | 29 | 84 % | 1.65 % |
# Call table of education inequality database
inequality_ed_df = pd.read_sql_table("education", engine)
inequality_ed_df.head(10)
| index | ISO3 | Country | Human Development Groups | UNDP Developing Regions | HDI Rank (2021) | Inequality in Education (2010) | Inequality in Education (2011) | Inequality in Education (2012) | Inequality in Education (2013) | Inequality in Education (2014) | Inequality in Education (2015) | Inequality in Education (2016) | Inequality in Education (2017) | Inequality in Education (2018) | Inequality in Education (2019) | Inequality in Education (2020) | Inequality in Education (2021) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | AFG | Afghanistan | Low | SA | 180.0 | 42.809000 | 44.823380 | 44.823380 | 44.823380 | 44.823380 | 45.365170 | 45.365170 | 45.365170 | 45.365170 | 45.365170 | 45.365170 | 45.365170 |
| 1 | 1 | AGO | Angola | Medium | SSA | 148.0 | NaN | NaN | NaN | NaN | NaN | 34.171440 | 34.171440 | 34.171440 | 34.171440 | 34.171440 | 34.171440 | 34.171440 |
| 2 | 2 | ALB | Albania | High | ECA | 67.0 | 11.900000 | 11.900000 | 11.900000 | 11.900000 | 11.900000 | 11.900000 | 11.900000 | 12.333440 | 12.333440 | 12.333440 | 12.333440 | 12.333440 |
| 3 | 3 | AND | Andorra | Very High | None | 40.0 | 15.160302 | 15.160302 | 15.160302 | 15.160302 | 9.965681 | 10.083815 | 10.008154 | 10.008154 | 10.008154 | 10.008154 | 10.008154 | 10.008154 |
| 4 | 4 | ARE | United Arab Emirates | Very High | AS | 26.0 | NaN | NaN | NaN | NaN | NaN | NaN | 18.241437 | 14.475335 | 12.634355 | 12.634355 | 12.634355 | 12.634355 |
| 5 | 5 | ARG | Argentina | Very High | LAC | 47.0 | 6.914310 | 6.830900 | 6.430560 | 6.513360 | 6.192280 | 6.331310 | 6.151940 | 6.208430 | 5.951810 | 5.787290 | 5.787290 | 5.787290 |
| 6 | 6 | ARM | Armenia | High | ECA | 85.0 | 3.685000 | 3.685000 | 3.685000 | 3.685000 | 3.685000 | 3.685000 | 2.934950 | 2.934950 | 2.934950 | 2.934950 | 2.934950 | 2.934950 |
| 7 | 7 | ATG | Antigua and Barbuda | High | LAC | 71.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 8 | AUS | Australia | Very High | None | 5.0 | 2.746427 | 2.477821 | 2.475142 | 2.935539 | 2.832901 | 2.720964 | 2.672566 | 3.043233 | 3.056570 | 3.056570 | 3.056570 | 3.056570 |
| 9 | 9 | AUT | Austria | Very High | None | 25.0 | 2.488360 | 3.667240 | 3.489860 | 4.297670 | 2.658110 | 2.607560 | 2.570640 | 2.999600 | 2.917450 | 2.917450 | 2.522410 | 2.522410 |
The following guiding questions will help determine if there are significant relationships between certain quantitative factors and the happiness of countries:
Studies have found that obtaining an education, regardless of the level, increased one’s happiness as education tended to boost an individual’s self-esteem, and raise one’s potential income and chances of employment (Cunado & Gracia, 2011). Even though education typically increases one’s happiness regardless of the level one is enrolled in, researchers have reported a clear negative relationship between education inequality scores and happiness. This article does not specifically state that the discrepancy in education quality is the leading cause of unhappiness, but rather people tend to compare themselves with their more successful peers in areas like social status, income, and occupation. This ultimately leads to feelings of jealousy (Nikolaev, 2015). Therefore, can a country's education inequality level significantly impact the overall happiness of its citizens? Or do other factors play a more dominant role, making education inequality less influential to a country's overall happiness score?
Happiness, with a similar meaning as life satisfaction, is likely to be reflected by one’s quality of life, which comprises numerous influencing factors. Some studies have suggested the potential correlations between population-related factors and quality of life. For example, population density is found to decrease with the level of happiness, mainly due to the accompanied problems and inconveniences caused by densely populated environments, like overcrowding and traffic congestion (Belikow et al., 2021; Li & Kanazawa, 2016). Besides overall population size, the percentage of urban population is another parameter that affects a country’s happiness level, as urbanization generally tends to have a significant positive effect on an individual’s happiness (Lenzi & Perucca, 2022). While fertility is influenced by different individual and social factors, happiness has been found to be negatively associated with the number of children a woman gives birth to (Margolis & Myrskylä, 2011). Therefore, correlations seem to exist between happiness and population related factors, and we aim to identify them on a country basis using the selected datasets to determine if these factors can serve as robust indicators of a country’s happiness level.
This question investigates whether more populated countries show more academic inequality due to their population size or if education inequality exists due to other factors such as a nation’s improper allocation of resources to meet academic demand. Researchers have concluded that a person’s social class is a major predictor of academic success, especially during an individual’s earlier years (García & Weiss, 2017). Most of the time, the poor stay poor, while the rich get richer. Even though inequality within a country is inevitable, there are, however, some countries that have shown impressive strides in narrowing the education discrepancy among high- and low-income families. Finland, for example, is one of the many countries that reconstructed its education system where all children receive the same education stream until 16 years old, regardless of their income bracket, disabilities, or gender (Saunders, 2016).
Getting an education, no matter the level, is associated with increased happiness, improved self-esteem, and better employment and income prospects. However, studies reveal a negative correlation between education inequality and happiness. While education generally enhances happiness, the article suggests that the perceived discrepancy in success compared to peers, especially in areas like social status and income, can lead to feelings of jealousy. This idea prompts the question: Does a country's education inequality significantly impact overall happiness?
# Load the Excel file
# (DataForTable2.1WHR2023.xls is the happiness dataset. However, it was downloaded and
# obtained directly from The World Happiness Report website instead of Kaggle.)
data_df = pd.read_excel("DataForTable2.1WHR2023.xls")
# Initialize a list to store DataFrames
top_countries_data_list = []
# For each year, find the top ten countries based on "Life Ladder" and add to the list
for year in sorted(data_df['year'].unique()):
top_countries_year = data_df[data_df['year'] == year].nlargest(10, 'Life Ladder')
# Check if Canada is in the top countries for the year, if not add it
if 'Canada' not in top_countries_year['Country name'].values:
canada_data_year = data_df[(data_df['Country name'] == 'Canada') & (data_df['year'] == year)]
top_countries_year = pd.concat([top_countries_year, canada_data_year])
top_countries_data_list.append(top_countries_year)
# Concatenate all DataFrames in the list
top_countries_data = pd.concat(top_countries_data_list)
top_countries_data.head(5)
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 505 | Denmark | 2005 | 8.018934 | 10.849012 | 0.972372 | 68.300003 | 0.971135 | NaN | 0.236522 | 0.776689 | 0.153672 |
| 1380 | Netherlands | 2005 | 7.463979 | 10.809070 | 0.947358 | 70.699997 | 0.901008 | NaN | 0.571342 | 0.700739 | 0.232795 |
| 327 | Canada | 2005 | 7.418048 | 10.707299 | 0.961552 | 70.500000 | 0.957306 | 0.248033 | 0.502681 | 0.783299 | 0.233278 |
| 1849 | Sweden | 2005 | 7.376316 | 10.724154 | 0.951470 | 71.000000 | 0.964395 | NaN | NaN | 0.742480 | 0.150766 |
| 76 | Australia | 2005 | 7.340688 | 10.662058 | 0.967892 | 69.800003 | 0.934973 | NaN | 0.390416 | 0.769770 | 0.238012 |
We will first create a data frame concerning only the most "happiest" countries across multiple years. Canada will also be captured and measured against these countries. Doing this will identify which countries Canada can learn from in terms of improving their citizen's happiness.
from sqlalchemy import text
# update null value to 0 under columns in Inequality
update_statement = text("""
UPDATE inequality
SET
`Inequality in Education (2010)` = COALESCE(`Inequality in Education (2010)`, 0),
`Inequality in Education (2011)` = COALESCE(`Inequality in Education (2011)`, 0),
`Inequality in Education (2012)` = COALESCE(`Inequality in Education (2012)`, 0),
`Inequality in Education (2013)` = COALESCE(`Inequality in Education (2013)`, 0),
`Inequality in Education (2014)` = COALESCE(`Inequality in Education (2014)`, 0),
`Inequality in Education (2015)` = COALESCE(`Inequality in Education (2015)`, 0),
`Inequality in Education (2016)` = COALESCE(`Inequality in Education (2016)`, 0),
`Inequality in Education (2017)` = COALESCE(`Inequality in Education (2017)`, 0),
`Inequality in Education (2018)` = COALESCE(`Inequality in Education (2018)`, 0),
`Inequality in Education (2019)` = COALESCE(`Inequality in Education (2019)`, 0),
`Inequality in Education (2020)` = COALESCE(`Inequality in Education (2020)`, 0),
`Inequality in Education (2021)` = COALESCE(`Inequality in Education (2021)`, 0);
""")
with engine.connect() as connection:
# Use a transaction
with connection.begin() as transaction:
connection.execute(update_statement)
There appeared to be multiple NULL values captured in the Education Inequality dataset. Leaving these cell values the way they are will disturb query output later on in the project. Therefore, it is better to clean them now to diminish potential issues. To do this, we will change any NULL values appearing in the education inequality score fields to 0. This way, we can still aggregate these data points if needed.
query_2 = pd.read_sql_query("SELECT * FROM happy WHERE year = 2020 ORDER BY 'Life Ladder' DESC;", engine)
query_2.head(20)
| index | Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 26 | Albania | 2020 | 5.364910 | 9.492215 | 0.710115 | 69.125000 | 0.753671 | 0.004123 | 0.891359 | 0.563486 | 0.265066 |
| 1 | 37 | Algeria | 2020 | 5.437755 | 9.291438 | 0.867649 | 66.500000 | 0.573891 | -0.121148 | 0.724264 | 0.524169 | 0.310630 |
| 2 | 57 | Argentina | 2020 | 5.900567 | 9.887623 | 0.897104 | 67.150002 | 0.823392 | -0.128546 | 0.815780 | 0.679317 | 0.342497 |
| 3 | 89 | Australia | 2020 | 7.137368 | 10.794416 | 0.936517 | 70.974998 | 0.905283 | 0.203258 | 0.491095 | 0.725689 | 0.205078 |
| 4 | 104 | Austria | 2020 | 7.213489 | 10.858776 | 0.924831 | 71.025002 | 0.911910 | 0.006125 | 0.463830 | 0.716352 | 0.206500 |
| 5 | 131 | Bahrain | 2020 | 6.173176 | 10.775782 | 0.847745 | 65.824997 | 0.945233 | 0.117025 | NaN | 0.729510 | 0.296835 |
| 6 | 146 | Bangladesh | 2020 | 5.279987 | 8.628981 | 0.739338 | 64.425003 | 0.777467 | -0.022516 | 0.741659 | 0.484560 | 0.331709 |
| 7 | 176 | Belgium | 2020 | 6.838761 | 10.799146 | 0.903559 | 70.750000 | 0.766918 | -0.170136 | 0.633627 | 0.619125 | 0.260189 |
| 8 | 192 | Benin | 2020 | 4.407746 | 8.066671 | 0.506636 | 55.799999 | 0.783115 | -0.082945 | 0.531884 | 0.556502 | 0.304512 |
| 9 | 212 | Bolivia | 2020 | 5.559259 | 8.946366 | 0.804811 | 63.450001 | 0.877032 | -0.052822 | 0.868208 | 0.728823 | 0.381791 |
| 10 | 227 | Bosnia and Herzegovina | 2020 | 5.515816 | 9.572131 | 0.898519 | 67.250000 | 0.740251 | 0.135483 | 0.916052 | 0.601871 | 0.325412 |
| 11 | 256 | Brazil | 2020 | 6.109718 | 9.548380 | 0.830832 | 65.574997 | 0.786235 | -0.057945 | 0.728772 | 0.653300 | 0.389139 |
| 12 | 270 | Bulgaria | 2020 | 5.597723 | 10.020541 | 0.916242 | 66.400002 | 0.818225 | -0.010107 | 0.900633 | 0.642388 | 0.221351 |
| 13 | 286 | Burkina Faso | 2020 | 4.639640 | 7.646704 | 0.667709 | 55.275002 | 0.750226 | 0.123013 | 0.808745 | 0.604512 | 0.388478 |
| 14 | 307 | Cambodia | 2020 | 4.376985 | 8.360816 | 0.724423 | 61.700001 | 0.963075 | 0.050262 | 0.863054 | 0.770770 | 0.389852 |
| 15 | 324 | Cameroon | 2020 | 5.241078 | 8.206723 | 0.720047 | 55.174999 | 0.674509 | 0.044859 | 0.836517 | 0.626285 | 0.386479 |
| 16 | 341 | Canada | 2020 | 7.024905 | 10.737793 | 0.930611 | 71.349998 | 0.886892 | 0.044788 | 0.434012 | 0.738485 | 0.306674 |
| 17 | 378 | Chile | 2020 | 6.150643 | 10.044019 | 0.888412 | 70.150002 | 0.781384 | 0.027615 | 0.811819 | 0.752885 | 0.336029 |
| 18 | 395 | China | 2020 | 5.771065 | 9.698723 | 0.808334 | 68.625000 | 0.891123 | -0.106377 | NaN | 0.662961 | 0.244918 |
| 19 | 411 | Colombia | 2020 | 5.709175 | 9.502091 | 0.797035 | 69.175003 | 0.840186 | -0.088326 | 0.807964 | 0.758543 | 0.340159 |
The SQL query above will retrieve 2020 information from the "happy" table. The outputted table will be ordered in descending order based on a country's happiness score (Life Ladder).
This query is useful as it helps identify which countries hold the highest happiness score, specifically during 2020. This information will be pivotal later on, as we will use it to compare it with education inequality score to see if there are any trends.
#-- example windowing function
query_5 = pd.read_sql_query("""
SELECT
RANK() OVER (PARTITION BY year ORDER BY `Life Ladder` DESC) AS country_rank,
`Country name`,
`Life Ladder`,
year
FROM happy;
""", engine)
query_5.head(10)
| country_rank | Country name | Life Ladder | year | |
|---|---|---|---|---|
| 0 | 1 | Denmark | 8.018934 | 2005 |
| 1 | 2 | Netherlands | 7.463979 | 2005 |
| 2 | 3 | Canada | 7.418048 | 2005 |
| 3 | 4 | Sweden | 7.376316 | 2005 |
| 4 | 5 | Australia | 7.340688 | 2005 |
| 5 | 6 | Belgium | 7.262290 | 2005 |
| 6 | 7 | Venezuela | 7.169621 | 2005 |
| 7 | 8 | Spain | 7.152786 | 2005 |
| 8 | 9 | France | 7.093393 | 2005 |
| 9 | 10 | Saudi Arabia | 7.079644 | 2005 |
Th query above uses information from the "happy" table. With the RANK() window function, we can assign each country a rank based on their happiness score for a certain year. Rankings will begin at 1, indicating the most happiest country, and will progress down accordingly. This function in conjunction with the PARTITION BY year clause will rank countries for each year separately. The output from this query will allow for comparative analysis where we can easily find the most happiest countries across multiple years.
query_11 = pd.read_sql_query("""
SELECT happy.`Country name`, happy.year, happy.`Life Ladder`, inequality.`Inequality in Education (2020)`
FROM happy
INNER JOIN inequality
ON happy.`Country name` LIKE inequality.Country || '%'
WHERE happy.year = 2020
ORDER BY inequality.`Inequality in Education (2020)` DESC;
""", engine)
We will now create a query that will investigate and attempt to answer our first guiding question, is there a correlation between a Country's happiness score and its education inequality score? To accomplish this task, the Happiness table and education inequality table will be joined using the inner join function. The join condition is based on a partial match, where the Country name from the happy table should start with the Country string from the inequality table (Inequality in Education around the World, n.d.). The query will retrieve information regarding country name, year, Life Ladder score, and Education Inequality score for 2020. Data will be filtered to present only records from 2020 and the results will be sorted based on life ladder score.
First, we will take a glimpse of the data to understand what information we are working with. Through the visualization, we can more adequately analyze the happiest countries based on their "Life Ladder" score across multiple years.
import pandas as pd
import plotly.express as px
# Create an animated horizontal bar plot
fig = px.bar(top_countries_data, y="Country name", x="Life Ladder", color="Country name",
animation_frame="year",
title="Top Ten Countries by Life Ladder Score Each Year (Including Canada)",
labels={"Country name": "Country", "Life Ladder": "Life Ladder Score"},
category_orders={"Country name": sorted(top_countries_data['Country name'].unique())},
orientation='h',
range_x=[6, top_countries_data['Life Ladder'].max() + 0.5])
# Adjusting the size of the figure (half of the default size)
fig.update_layout(width=1000, height=600)
# Slow down the animation speed (half the default speed)
fig.show(animation_options={"frame": {"duration": 10000}, "transition": {"duration": 500}})
# Export the plot as an HTML file
fig.write_html("animated_bar_plot.html")
The animated bar plot provides a dynamic and interactive way to observe changes in rankings over the years. This visualization will first give us a brief overview of how a country's happiness rank changed year-over-year. This visual also illustrates which countries are constantly performing well, noting them as key influencers whom others can learn from. Interestingly, we can also examine how Canada fares against other top ranking countries.
import plotly.graph_objects as go
testhappy = pd.read_sql_table('happy', engine)
latest_year = testhappy['year'].max()
top_10_countries_latest = testhappy[testhappy['year'] == latest_year].nlargest(10, 'Life Ladder')['Country name']
historical_data_top_10 = testhappy[testhappy['Country name'].isin(top_10_countries_latest)].copy()
historical_data_top_10['Rank'] = historical_data_top_10.groupby('year')['Life Ladder'].rank("dense", ascending=False)
pivot_data_top_10 = historical_data_top_10.pivot(index='Country name', columns='year', values='Rank')
fig_top_10 = go.Figure()
for country in pivot_data_top_10.index:
fig_top_10.add_trace(go.Scatter(
x=pivot_data_top_10.columns,
y=pivot_data_top_10.loc[country],
mode='lines+markers',
name=country,
line=dict(width=3) # Set the line width to 3
))
fig_top_10.update_layout(
title="Ranking of Top 10 Countries in Life Ladder Over Time",
xaxis_title="Year",
yaxis_title="Rank",
height=400, width=1000,
yaxis=dict(autorange="reversed"),
showlegend=True
)
fig_top_10.show()
fig.write_html("bumpchart.html")
We will use a bump chart, which is available in the plotly library, to identify the top 10 countries that have ranked the highest from 2004 to 2022. By analyzing the chart, we can observe that Denmark, Finland, and Norway have consistently outperformed other countries in terms of meeting the needs of their citizens over the past three years.
import pandas as pd
import plotly.express as px
# Filtering the dataset for the year 2020
data_2020 = query_11[query_11['year'] == 2020]
# Creating the map visualization
fig = px.choropleth(data_2020,
locations="Country name",
locationmode="country names",
color="Life Ladder",
hover_data=["Life Ladder", "Inequality in Education (2020)"], # Add "Life Ladder" and "Inequality in Education (2020)" to hover data
color_continuous_scale="Viridis",
title="Life Ladder Scores and Inequality in Education (2020) Around the World in 2020")
# Set the height and width of the figure
fig.update_layout(height=500, width=1000, margin=dict(l=0, r=0, t=0, b=0))
# Show the figure (in an interactive Python environment)
fig.show()
fig.write_html("happymap.html")
Through Plotly's map function, we are better able to determine which countries are the happiest. From the map, we can clearly see that the Northern part of the globe holds a higher average happiness score. This is, however, contrasted by the Southern part of the globe, which shows that those countries typically hold a lower average happiness score during 2020.
#select few columns from each table then merge two tables into one by column name where happy.`Country name` LIKE inequality.Country
query_10 = pd.read_sql_query("""
SELECT happy.`Country name`, happy.year, happy.`Life Ladder`, inequality.`Inequality in Education (2020)`
FROM happy
INNER JOIN inequality
ON happy.`Country name` LIKE inequality.Country || '%'
WHERE happy.year = 2020
ORDER BY inequality.`Inequality in Education (2020)` DESC;
""", engine)
query_10.head(20)
| Country name | year | Life Ladder | Inequality in Education (2020) | |
|---|---|---|---|---|
| 0 | Guinea | 2020 | 4.972168 | 50.124110 |
| 1 | Senegal | 2020 | 4.756773 | 47.135300 |
| 2 | Ivory Coast | 2020 | 5.256504 | 45.609040 |
| 3 | Mali | 2020 | 4.269474 | 43.859650 |
| 4 | Benin | 2020 | 4.407746 | 43.693550 |
| 5 | Pakistan | 2020 | 4.623969 | 43.527950 |
| 6 | Ethiopia | 2020 | 4.549220 | 42.770360 |
| 7 | Morocco | 2020 | 4.802618 | 41.870930 |
| 8 | Nepal | 2020 | 5.982410 | 41.099200 |
| 9 | Nigeria | 2020 | 5.502948 | 40.415070 |
| 10 | Burkina Faso | 2020 | 4.639640 | 39.152794 |
| 11 | Bangladesh | 2020 | 5.279987 | 37.311770 |
| 12 | Egypt | 2020 | 4.472397 | 36.942910 |
| 13 | India | 2020 | 4.223866 | 36.875730 |
| 14 | Ghana | 2020 | 5.319483 | 35.101670 |
| 15 | Algeria | 2020 | 5.437755 | 33.282620 |
| 16 | Cameroon | 2020 | 5.241078 | 31.716990 |
| 17 | Tunisia | 2020 | 4.730811 | 30.702850 |
| 18 | Iraq | 2020 | 4.785165 | 29.745910 |
| 19 | Uganda | 2020 | 4.640910 | 27.914380 |
This query highlights that countries with the lowest happiness scores are typically located farther south than Canada, often in African and Middle Eastern regions. From the table, the top 5 countries with the greatest education inequality score all reside from Africa.
import numpy as np
import matplotlib.pyplot as plt
# Create a smooth line
x = np.linspace(min(query_11['Inequality in Education (2020)']), max(query_11['Inequality in Education (2020)']), 100)
y = np.poly1d(np.polyfit(query_11['Inequality in Education (2020)'], query_11['Life Ladder'], 3))(x)
# Plot the scatter plot with smooth line
plt.scatter(query_11['Inequality in Education (2020)'], query_11['Life Ladder'], c=query_11['Life Ladder'], cmap='Oranges')
plt.plot(x, y, color='black', linewidth=2, linestyle='--') # Use dashed line
# Set labels and title
plt.xlabel('Inequality in Education (2020)')
plt.ylabel('Life Ladder')
plt.title('Correlation between Inequality in Education (2020) and Life Ladder')
# Add colorbar
plt.colorbar(label='Life Ladder')
# Show the plot
plt.show()
By using a scatter plot created with Plotly, we can observe the relationship between the Happiness score and the Inequality in Education score. The scatter plot indicates a clear negative correlation between these two variables. As the Inequality in Education score goes up, the Life Ladder (happiness) score goes down. Therefore, we can conclude that there is an inverse relationship between these two factors.
Education Inequality score and Happiness score appear to have a negative correlation. This is depicted as Education Inequality increases, and the Life ladder (happiness) score decreases. It is also evident that there is a significant geographical difference in terms of happiness and educational inequality across the world. Countries located in the northern hemisphere tend to have higher average happiness scores and lower levels of educational inequality. On the other hand, countries in the southern hemisphere seem to have comparatively lower happiness scores and higher education inequality scores for the year 2020. This geographical contrast highlights the differences in societal well-being and educational access, underlining the need for a deeper understanding of the root causes driving these disparities.
Happiness, which is closely related to life satisfaction, is influenced by various factors that contribute to one's overall quality of life. Some studies have indicated potential correlations between population-related factors and quality of life.
The provided evidence indicates that there may be links between happiness and population-related factors. We aim to investigate these connections at a country level by utilizing the chosen datasets. We will examine whether the factors mentioned earlier can be considered dependable indicators of a country's overall happiness level.
query_datatype = """
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'population_db' OR TABLE_NAME = 'happiness_db';
"""
pd.read_sql_query(query_datatype, engine)
| COLUMN_NAME | DATA_TYPE | |
|---|---|---|
| 0 | index | bigint |
| 1 | Country name | text |
| 2 | Regional indicator | text |
| 3 | Ladder score | double |
| 4 | Standard error of ladder score | double |
| 5 | upperwhisker | double |
| 6 | lowerwhisker | double |
| 7 | Logged GDP per capita | double |
| 8 | Social support | double |
| 9 | Healthy life expectancy | double |
| 10 | Freedom to make life choices | double |
| 11 | Generosity | double |
| 12 | Perceptions of corruption | double |
| 13 | Ladder score in Dystopia | double |
| 14 | Explained by: Log GDP per capita | double |
| 15 | Explained by: Social support | double |
| 16 | Explained by: Healthy life expectancy | double |
| 17 | Explained by: Freedom to make life choices | double |
| 18 | Explained by: Generosity | double |
| 19 | Explained by: Perceptions of corruption | double |
| 20 | Dystopia + residual | double |
| 21 | index | bigint |
| 22 | Country | text |
| 23 | Population | bigint |
| 24 | Growth | double |
| 25 | Net Change | bigint |
| 26 | Density | bigint |
| 27 | Area | bigint |
| 28 | Migrants | double |
| 29 | Fertility Rate | double |
| 30 | Median Age | double |
| 31 | Urban Population | double |
| 32 | World Share | double |
The above query is performed to display the datatype of each column in both the happiness and population datasets. From the above table, there is a part that requires data wrangling to change the data type. It is noticed that the Growth, urban population and world share columns are in datatype of text because the values in these columns contain a percentage sign. For the sake of using the data from these columns for correlation analysis, the datatype has to be changed to double, which is a kind of float.
value_setting_statement = sq.text("""
UPDATE population_db
SET
Growth = REPLACE(Growth, '%', ''),
`Urban Population` = REPLACE(`Urban Population`, '%', ''),
`World Share` = REPLACE(`World Share`, '%', '');
""")
with engine.connect() as connection:
connection.execute(value_setting_statement)
pd.read_sql_table("population_db", engine)
| index | Country | Population | Growth | Net Change | Density | Area | Migrants | Fertility Rate | Median Age | Urban Population | World Share | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | China | 1440297825 | 0.39 | 5540090 | 153 | 9388211 | -348399.0 | 1.7 | 38.0 | 61.0 | 18.47 |
| 1 | 1 | India | 1382345085 | 0.99 | 13586631 | 464 | 2973190 | -532687.0 | 2.2 | 28.0 | 35.0 | 17.70 |
| 2 | 2 | United States | 331341050 | 0.59 | 1937734 | 36 | 9147420 | 954806.0 | 1.8 | 38.0 | 83.0 | 4.25 |
| 3 | 3 | Indonesia | 274021604 | 1.07 | 2898047 | 151 | 1811570 | -98955.0 | 2.3 | 30.0 | 56.0 | 3.51 |
| 4 | 4 | Pakistan | 221612785 | 2.00 | 4327022 | 287 | 770880 | -233379.0 | 3.6 | 23.0 | 35.0 | 2.83 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 230 | 230 | Montserrat | 4993 | 0.06 | 3 | 50 | 100 | NaN | NaN | NaN | 10.0 | 0.00 |
| 231 | 231 | Falkland Islands | 3497 | 3.05 | 103 | 0 | 12170 | NaN | NaN | NaN | 66.0 | 0.00 |
| 232 | 232 | Niue | 1628 | 0.68 | 11 | 6 | 260 | NaN | NaN | NaN | 46.0 | 0.00 |
| 233 | 233 | Tokelau | 1360 | 1.27 | 17 | 136 | 10 | NaN | NaN | NaN | 0.0 | 0.00 |
| 234 | 234 | Holy See | 801 | 0.25 | 2 | 2003 | 0 | NaN | NaN | NaN | NaN | 0.00 |
235 rows × 12 columns
This query removes percentage signs from the 'Growth,' 'Urban Population,' and 'World Share' columns. The objective is to convert these columns from string representations of percentages to numeric values, facilitating analysis.
datatype_modifying_statement = sq.text("""
ALTER TABLE population_db
MODIFY COLUMN Growth DOUBLE,
MODIFY COLUMN `Urban Population` DOUBLE,
MODIFY COLUMN `World Share` DOUBLE;
""")
with engine.connect() as connection:
connection.execute(datatype_modifying_statement)
This query changes the data type of the 'Growth,' 'Urban Population,' and 'World Share' columns from their previous data type to double.
query_datatype = """
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'population_db' OR TABLE_NAME = 'happiness_db';
"""
pd.read_sql_query(query_datatype, engine)
| COLUMN_NAME | DATA_TYPE | |
|---|---|---|
| 0 | index | bigint |
| 1 | Country name | text |
| 2 | Regional indicator | text |
| 3 | Ladder score | double |
| 4 | Standard error of ladder score | double |
| 5 | upperwhisker | double |
| 6 | lowerwhisker | double |
| 7 | Logged GDP per capita | double |
| 8 | Social support | double |
| 9 | Healthy life expectancy | double |
| 10 | Freedom to make life choices | double |
| 11 | Generosity | double |
| 12 | Perceptions of corruption | double |
| 13 | Ladder score in Dystopia | double |
| 14 | Explained by: Log GDP per capita | double |
| 15 | Explained by: Social support | double |
| 16 | Explained by: Healthy life expectancy | double |
| 17 | Explained by: Freedom to make life choices | double |
| 18 | Explained by: Generosity | double |
| 19 | Explained by: Perceptions of corruption | double |
| 20 | Dystopia + residual | double |
| 21 | index | bigint |
| 22 | Country | text |
| 23 | Population | bigint |
| 24 | Growth | double |
| 25 | Net Change | bigint |
| 26 | Density | bigint |
| 27 | Area | bigint |
| 28 | Migrants | double |
| 29 | Fertility Rate | double |
| 30 | Median Age | double |
| 31 | Urban Population | double |
| 32 | World Share | double |
Now, the datatype of all columns looks to be correct and the data is ready for use.
population_db = pd.read_sql_table("population_db", engine)
population_db.isna().sum()
index 0 Country 0 Population 0 Growth 0 Net Change 0 Density 0 Area 0 Migrants 34 Fertility Rate 34 Median Age 34 Urban Population 13 World Share 0 dtype: int64
In the selected dataset, certain columns such as "Migrants", "Fertility Rate", "Median Age", and "Urban Population" contain null values. It is important to note that the "Migrants" column is not utilized in this project.
query_join1 = """
SELECT `Country name`, `Regional indicator`, Population, Growth, Density, `Fertility Rate`, `Median Age`, `Urban Population`, `Ladder score`
FROM happiness_db INNER JOIN population_db
ON happiness_db.`Country name` LIKE CONCAT(population_db.Country, '%')
GROUP BY `Country name`
ORDER BY `Ladder score` DESC;
"""
joinedtable = pd.read_sql_query(query_join1, engine)
joinedtable
| Country name | Regional indicator | Population | Growth | Density | Fertility Rate | Median Age | Urban Population | Ladder score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Finland | Western Europe | 5542237 | 0.15 | 18 | 1.5 | 43.0 | 86.0 | 7.8087 |
| 1 | Denmark | Western Europe | 5795780 | 0.35 | 137 | 1.8 | 42.0 | 88.0 | 7.6456 |
| 2 | Switzerland | Western Europe | 8665615 | 0.74 | 219 | 1.5 | 43.0 | 74.0 | 7.5599 |
| 3 | Iceland | Western Europe | 341628 | 0.65 | 3 | 1.8 | 37.0 | 94.0 | 7.5045 |
| 4 | Norway | Western Europe | 5428594 | 0.79 | 15 | 1.7 | 40.0 | 83.0 | 7.4880 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 141 | Central African Republic | Sub-Saharan Africa | 4843954 | 1.78 | 8 | 4.8 | 18.0 | 43.0 | 3.4759 |
| 142 | Rwanda | Sub-Saharan Africa | 13005303 | 2.58 | 525 | 4.1 | 20.0 | 18.0 | 3.3123 |
| 143 | Zimbabwe | Sub-Saharan Africa | 14899771 | 1.48 | 38 | 3.6 | 19.0 | 38.0 | 3.2992 |
| 144 | South Sudan | Sub-Saharan Africa | 11216250 | 1.19 | 18 | 4.7 | 19.0 | 25.0 | 2.8166 |
| 145 | Afghanistan | South Asia | 39074280 | 2.33 | 60 | 4.6 | 18.0 | 25.0 | 2.5669 |
146 rows × 9 columns
The population database and the happiness database are merged using the country name as the key column. An inner join is performed to ensure that only countries with data in both databases are included. Since the country names in the two databases may have variations, such as short forms or special characters, the LIKE operator is used to match similar country names. The resulting joined table is used for analysis and exploration in the following.
joinedtable.isna().sum()
Country name 0 Regional indicator 0 Population 0 Growth 0 Density 0 Fertility Rate 0 Median Age 0 Urban Population 4 Ladder score 0 dtype: int64
Upon merging the happiness dataset to create the joined table, it is observed that only one column ("Urban Population") contains null values. Instead of removing these rows due to the null values, a decision is made to keep them. The rationale behind this decision is that the number of rows affected by null values is minimal (~2%), specifically limited to four values.
In addition to the limited impact, the choice to keep these rows is reinforced by the recognition that the remaining columns in these specific rows still contain valuable and meaningful information. Because of this, it is deemed more beneficial to keep all the rows in the joinedtable, including the rows with null values, rather than removing them.
import plotly.express as px
# Data preparation
correlation_data = joinedtable[['Population', 'Growth', 'Density', 'Fertility Rate', 'Median Age', 'Urban Population', 'Ladder score']]
correlation_matrix = correlation_data.corr() # Calculate the correlation matrix
# Creating the heatmap with values displayed
fig = px.imshow(correlation_matrix,
x = correlation_matrix.columns,
y = correlation_matrix.columns,
text_auto = '.2f') # Automatically add text in each cell
fig.update_layout(title = "Correlation Heatmap of Happiness Ladder Score and Population Related Parameters",
height = 600,
width = 1000,
paper_bgcolor='black', # Background color
font_color='white', # General text color
title_font_color='white') # Title text color
fig.show()
This heatmap illustrates the relationships among various variables, such as population, population growth, density, fertility rate, median age, percentage of urban population, and happiness ladder score. Notably, the ladder score exhibits robust associations with median age, urban population, and fertility rate, as evidenced by high correlation coefficients exceeding 0.6. Additionally, a significant link exists between ladder score and population growth. However, given the high inter-correlations between growth, median age, and fertility rate, and the comparatively weaker correlation with ladder score, we decide to exclude the growth parameter to mitigate the impact of multicollinearity among independent variables.
While considerable intercorrelations persist between median age, fertility rate, and urban population percentage, these variables also share similarly high correlations with the happiness score. Consequently, a more in-depth analysis of these variables is needed to explore the most reliable indicator of happiness level.
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Ladder Score vs Fertility Rate
fig1 = px.scatter(joinedtable,
x = 'Fertility Rate',
y = 'Ladder score',
trendline = 'ols', # Add a trendline
color_discrete_sequence = ['skyblue'], # Set color of dots
trendline_color_override = 'lightblue') # Set color of trendline
# Ladder Score vs Median Age
fig2 = px.scatter(joinedtable,
x = 'Median Age',
y = 'Ladder score',
trendline = 'ols', # Add a trendline
color_discrete_sequence=['lightcoral'], # Set color of dots
trendline_color_override = 'lightpink') # Set color of trendline
# Ladder Score vs Urban Population
fig3 = px.scatter(joinedtable,
x = 'Urban Population',
y = 'Ladder score',
trendline = 'ols', # Add a trendline
color_discrete_sequence=['lightseagreen'], # Set color of dots
trendline_color_override = 'lightgreen') # Set color of trendline
# Create subplots
fig = make_subplots(rows = 1, cols = 3, subplot_titles=("Ladder Score vs Fertility Rate",
"Ladder Score vs Median Age",
"Ladder Score vs Urban Population"))
# Add the scatter plots to the subplots
for data in fig1.data:
fig.add_trace(data, row = 1, col = 1)
for data in fig2.data:
fig.add_trace(data, row = 1, col = 2)
for data in fig3.data:
fig.add_trace(data, row = 1, col = 3)
# Update x-axis and y-axis titles for each subplot
fig.update_yaxes(row = 1, col = 1, title_text = "Ladder Score")
fig.update_xaxes(row = 1, col = 1, title_text = "Fertility Rate (%)")
fig.update_xaxes(row = 1, col = 2, title_text = "Median Age")
fig.update_xaxes(row = 1, col = 3, title_text = "Urban Population (%)")
# Update layout
fig.update_layout(height = 400,
width = 1000,
paper_bgcolor = 'black', # Background color
font_color= 'white', # General font color
plot_bgcolor = 'black', # Plot background color
title_text = "Scatterplots of Happiness Score vs Potential Influencing Factors")
fig.update_traces(marker = dict(size = 8, line = dict(width = 1.5, color = 'black')),
selector =dict(mode = 'markers')) # Set marker size and color
fig.show()
We generate scatterplots to explore the relationships between the happiness score and the three identified factors. Our observations reveal noticeable correlations in all three plots, where the ladder score demonstrates a negative association with fertility rate and a positive association with both median age and the percentage of urban population. Upon closer examination of individual plots, a concentration of data points is noticed at low fertility rates, forming a curve-like distribution. Consequently, when compared with the other two variables, fertility rate appears less optimal as an indicator.
In contrast, both median age and urban population percentage exhibit similar distributions, with data points evenly dispersed along the fitted line. Consequently, both factors appear suitable as indicators of happiness level. However, considering the high intercorrelation between these two factors, we should choose either one. Based on correlation coefficients, median age slightly outperforms urban population percentage, being the most favorable indicator of happiness level among all population-related parameters. The strong correlation between happiness and median age may be attributed to the idea that happier individuals tend to live longer, and hence result in higher median age of people in happier countries.
In conclusion, this analysis provides valuable insights into the dataset and highlights the relationships between happiness level and different population-related variables. We found negative correlations with fertility rate and positive correlations with median age and urban population. While both median age and urban population are suitable happiness indicators, due to their high intercorrelation, median age is considered as a more reliable happiness level indicator among population-related factors. The findings can be used to further explore and understand the factors influencing happiness levels in the population.
This inquiry aims to determine whether countries with larger populations exhibit more academic inequality due to their size or if educational inequality arises from other factors, such as a country's improper allocation of resources to meet academic demand. Studies have shown that an individual's social class significantly impacts their academic success, particularly during their formative years (García & Weiss, 2017). Unfortunately, it is often the case that those who start with less tend to stay behind, while those who start with more tend to keep advancing. Despite the inevitability of inequality within a country, some have made progress in narrowing the education gap among high and low-income families. One such example is Finland, which overhauled its education system, offering students the same education stream until the age of 16, regardless of their income, disability, or gender (Saunders, 2016).
query = """
SELECT h.`Country name`,
e.`Human Development Groups`,
e.`Inequality in Education (2020)`,
p.`Density (P/Km²)`
FROM happiness h
INNER JOIN education e ON h.`Country name` = e.`Country`
INNER JOIN population2020 p ON h.`Country name` = p.`Country (or dependency)`
"""
table_q1 = pd.read_sql_query(query, engine)
table_q1
| Country name | Human Development Groups | Inequality in Education (2020) | Density (P/Km²) | |
|---|---|---|---|---|
| 0 | China | High | 11.680000 | 153 |
| 1 | India | Medium | 36.875730 | 464 |
| 2 | United States | Very High | 2.745850 | 36 |
| 3 | Indonesia | High | 17.250017 | 151 |
| 4 | Pakistan | Low | 43.527950 | 287 |
| ... | ... | ... | ... | ... |
| 134 | Montenegro | Very High | 7.815010 | 47 |
| 135 | Luxembourg | Very High | 4.717070 | 242 |
| 136 | Maldives | High | 29.319480 | 1802 |
| 137 | Malta | Very High | 5.192380 | 1380 |
| 138 | Iceland | Very High | 2.200770 | 3 |
139 rows × 4 columns
This query divides the data between different Human Development Groups. This was done to fairly compare countries undergoing similar Human Development Index attributes (i.e., life expectancy, education, and per capita income). Population density was used in favor to population as it provides a more general sense of population in a given area. The following table will be used to determine if a more populated country in regards to density affects education inequality.
# check for unnecssary null values
table_q1[table_q1.isna().any(axis=1)]
| Country name | Human Development Groups | Inequality in Education (2020) | Density (P/Km²) | |
|---|---|---|---|---|
| 92 | Libya | High | NaN | 4 |
| 129 | Trinidad and Tobago | Very High | NaN | 273 |
The code above will check for any unnecssary null values.
# remove null rows
final1 = table_q1.dropna(how='any',axis=0)
final1 = final1.reset_index()
final1.head(10)
| index | Country name | Human Development Groups | Inequality in Education (2020) | Density (P/Km²) | |
|---|---|---|---|---|---|
| 0 | 0 | China | High | 11.680000 | 153 |
| 1 | 1 | India | Medium | 36.875730 | 464 |
| 2 | 2 | United States | Very High | 2.745850 | 36 |
| 3 | 3 | Indonesia | High | 17.250017 | 151 |
| 4 | 4 | Pakistan | Low | 43.527950 | 287 |
| 5 | 5 | Brazil | High | 15.652570 | 25 |
| 6 | 6 | Nigeria | Low | 40.415070 | 226 |
| 7 | 7 | Bangladesh | Medium | 37.311770 | 1265 |
| 8 | 8 | Mexico | High | 13.487955 | 66 |
| 9 | 9 | Japan | Very High | 4.511000 | 347 |
Since there were instances of null values in the table, we will use the code above to remove any null value data instances, while reseting the index.
query = """
SELECT
h.`Regional indicator`,
AVG(e.`Inequality in Education (2020)`) AS avg_inequality_2020
FROM
happiness h
INNER JOIN
education e ON h.`Country name` = e.`Country`
GROUP BY
h.`Regional indicator`;
"""
final2 = pd.read_sql_query(query, engine)
# Produce final table to use in visualization
final2
| Regional indicator | avg_inequality_2020 | |
|---|---|---|
| 0 | Central and Eastern Europe | 5.478769 |
| 1 | Commonwealth of Independent States | 3.117286 |
| 2 | East Asia | 9.227668 |
| 3 | Latin America and Caribbean | 16.096977 |
| 4 | Middle East and North Africa | 22.042057 |
| 5 | North America and ANZ | 2.540888 |
| 6 | South Asia | 35.070890 |
| 7 | Southeast Asia | 17.202814 |
| 8 | Sub-Saharan Africa | 34.618264 |
| 9 | Western Europe | 5.761916 |
This query seeks to identify the regions with the highest education inequality. Furthermore, the query's output serves as a statistical benchmark to help countries determine how they compare to their region's average education inequality score. The hope is that countries with higher education inequality scores located in more equitable regions will be encouraged to assess and implement more equitable infrastructure to reduce their education inequality score.
The table above seems to not have any null values, therefore no additional measures need to be done to remove null data instances.
import plotly.express as px
# Create Scatter plot:
fig = px.scatter(final1, y="Inequality in Education (2020)", x="Density (P/Km²)",
color="Human Development Groups", trendline="ols", labels={
"Inequality in Education (2020)": "Education Inequality (2020)",
"Density (P/Km²)": "Population Density (P/Km²)"
},
title="Population Density vs Education Inequality for 2020")
# update plot size:
fig.update_layout(
autosize=False,
width=1200,
height=500,)
fig.update_xaxes(range = [-75,2000])
# show figure:
fig.show()
From the scatterplot above, the trend lines for medium, high, and very high human development groups indicate that as the density population increases, so does education inequality. This notion is consistent with a Harvard study that examines the relationship of population density to educational inequality in the United States. In this report, researchers found that communities with denser population levels had a higher socioeconomic gap, which contributed to the increase in inequality in the area (Carlos, n.d.). Although most of the data supports the previous study, low human development countries show a negative relationship between population density and education inequality. This means that as the population density increases, education inequality decreases. One of the possible reasons behind this result is that urbanization has positive effects on rural-like environments. As the population density increases, there is more access to resources and institutions including healthcare, transportation, and education. This suggests that more access to educational resources in rural areas can reduce education inequality (Bloom & Khanna, 2007).
import plotly.express as px
# Create bar chart:
fig1 = px.bar(final2, x="Regional indicator", y="avg_inequality_2020", labels={
"avg_inequality_2020": "Average Education Inequality Score (2020)",
"Regional indicator": "Region"
},
title="Average Education Inequality Score by Regional in 2020", color = "Regional indicator")
# update plot size:
fig1.update_layout(
autosize=False,
width=1000,
height=500,)
# remove legend as it was not necessary since it was the same as x-axis labels
fig1.update_layout(showlegend=False)
# show figure:
fig1.show()
Inequality is present in every society, and it can manifest in various forms such as gender, age, ethnicity, origin, disability, or social class disparities (United Nations, n.d.). When it comes to education inequality, based on observations of several countries' education systems, three main factors seem to affect the dependent variable.
The bar graph above displays the differences in education inequality among various regions in the world during 2020. Some regions have relatively lower average education inequality scores than others. For instance, the North America and ANZ region scored the lowest with an average education inequality score of 2.54, whereas regions such as South Asia and Sub-Saharan Africa recorded higher-than-average scores of over 30. This variation in education inequality could be attributed to the different types of inequality present and the extent to which they exist within each region. Mismanagement of the above factors could lead to a rise in education inequality, which can negatively impact a country's education equality system.
Other reasons behind a specific region's education inequality score:
Although countries like Switzerland and Finland are considered to be happier and have a lower education inequality score compared to the USA, the overall average of Western Europe, to which they belong, is still higher than its counterpart. This is because some other countries in the region, like Spain, exhibit greater inequality due to the effects of socioeconomic origin and status. Low socioeconomic status often leads to poor academic performance within the country (Mendizabal, 2018).
Over the years, many countries have made significant improvements to reduce education inequality. Although some countries still discriminate against women's education rights, most countries have made efforts to expand their education systems. For instance, in the mid-2000s, women in China gained more opportunities to receive education. As a result, gender equality in the region improved over time, and the gap in college enrollment was even reversed to favor women in 2005 (Wu & Zhang, 2010).
In terms of education inequality, East Asia has a slightly higher score compared to Western countries. In addition to social class, this fact can be attributed to the competitive and rigorous academic structure in Eastern countries, where good grades are essential for academic and professional advancement. In China, for instance, students have to pass entrance exams to gain admission to prestigious high schools and universities in the future. This competitive environment impacts students' mental health, which leads some students to cheat on their exams for a better life (Fang, 2019).
In conclusion, the analysis of education inequality across different human development groups and regions in 2020 has highlighted significant disparities and multifaceted causes. While increasing population density corresponds to higher education inequality in medium, high, and very-high-development countries, low-human-development countries present a counterintuitive negative relationship, suggesting that increased population density can reduce education inequality by enhancing access to resources in rural environments. Thus, it is crucial to consider regional dynamics and urbanization effects in addressing global education disparities.
In addition to other factors, social class, origin-based discrimination, and gender-related discrimination all play a significant role in creating education inequality. North America, Australia, and New Zealand have been successful in managing these factors, resulting in a fair distribution of education compared to other parts of the world.
Team Member 1 - Bo Li:
Team Member 2 - Brian Ho:
Team Member 3 - Ethan Mah:
Team Member 1 - Bo Li:
Team Member 2 - Brian Ho:
Team Member 3 - Ethan Mah:
Team Member 2 - Brian Ho:
In reference to our guiding questions, we found a negative relationship between a country's happiness score and education inequality, in which as education inequality increased, the happiness score decreased. Our study also investigated population-related factors that influence happiness, finding that median age is the most reliable indicator to determine a person's happiness. In regards to geographical locations, countries in the northern part of the world display a tendency to be happier than their southern counterparts. In addition, more densely populated countries tend to possess greater education inequality scores. This statement is true except for developing countries with low human development index where urbanization has reduced education inequality.
Our analysis was meant to show, through breaking down survey information and large statistical datasets, how a country's happiness score can be influenced by factors beyond what was recorded in The World Happiness Report. Our specific datasets provided invaluable insights into the key demographic indicators and optimal geographical locations that have the greatest effect on one's happiness. While more work can still be done to create a more comprehensive analysis, our current findings on the subject can still prove to be useful for policymakers and individuals alike who strive for a happier life.
Belikow, A., DeWeese, J., Ravensbergen, L., Kestens, Y., & El-Geneidy, A. (2021). Are we happy in densely populated environments? assessing the impacts of density on subjective well-being, quality of life, and perceived health in Montreal, Canada. Findings. https://doi.org/10.32866/001c.23718
Bloon, D., & Khanna, T. (2007, September). The urban revolution. Retrieved from https://www.imf.org/external/pubs/ft/fandd/2007/09/bloom.htm
Carlos. (n.d.). Population density and educational inequality: the role of public school choice and accountability. Retrieved from https://scholar.harvard.edu/files/clastraanadon/files/lastra_density18.pdf
Cuñado, J., & Gracia, F. (2011, June 03). Does education affect happiness? Retrieved from https://link.springer.com/article/10.1007/s11205-011-9874-x#citeas
Fang, A. (2019, December 21). Chinese education system versus U.S education system. Retrieved from https://blog.tutorabcchinese.com/chinese-culture/chinese-education-system-vs-us-education-system
Garcia, E., & Weiss, E. (2017, September 27). Education inequalities at the school starting gate. Retrieved from https://www.epi.org/publication/education-inequalities-at-the-school-starting-gate/
Lenzi, C., & Perucca, G. (2022). No place for poor men: On the asymmetric effect of urbanization on Life Satisfaction. Social Indicators Research, 164(1), 165–187. https://doi.org/10.1007/s11205-022-02946-1
Li, N. P., & Kanazawa, S. (2016). Country roads, take me home… to My Friends: How Intelligence, population density, and friendship affect modern happiness. British Journal of Psychology, 107(4), 675–697. https://doi.org/10.1111/bjop.12181
Margolis, R., & Myrskylä, M. (2011). A global perspective on happiness and fertility. Population and Development Review, 37(1), 29–56. https://doi.org/10.1111/j.1728-4457.2011.00389.x
Mendizabal, A. (2018, December). Socioeconomic inequalities and academic achievement in Spain. Retrieved from https://elobservatoriosocial.fundacionlacaixa.org/en/-/desigualdades-socioeconomicas-y-rendimiento-academico
Nikolaev, B. (2016, January 26). Does other people’s education make us less happy? Retrieved from https://www.sciencedirect.com/science/article/pii/S0272775715300194#preview-section-cited-by
Saunders, D. (2016, April 22). Finland’s social climbers: how they’re fighting inequality with education, and winning. Retrieved from https://www.theglobeandmail.com/news/national/education/how-finland-is-fighting-inequality-with-education-andwinning/article29716845/
Teahcing about educational inequality. (n.d.). Retrieved from https://educators4sc.org/teaching-about-educational-inequality/
United Nations. (n.d.). Inequality - bridging the divide. Retrieved from https://www.un.org/en/un75/inequality-bridging-divide
Wiher, A. (2020). Education inequality in different countries. Retrieved from https://wp-prd.let.ethz.ch/WP0-CIPRF91243/chapter/education-inequality-in-different-countries-2/
Wu, X., & Zhang, Z. (2010, May 11). Changes in educational inequality in China. Retrieved from https://www.emerald.com/insight/content/doi/10.1108/S1479-3539%282010%290000017007/full/html